Machine learning can aid in prediction of IDH mutation from H&E-stained histology slides in infiltrating gliomas

While Machine Learning (ML) models have been increasingly applied to a range of histopathology tasks, there has been little emphasis on characterizing these models and contrasting them with human experts. We present a detailed empirical analysis comparing expert neuropathologists and ML models at predicting IDH mutation status in H&E-stained histology slides of infiltrating gliomas, both independently and synergistically. We find that errors made by neuropathologists and ML models trained using the TCGA dataset are distinct, representing modest agreement between predictions (human-vs.-human κ = 0.656; human-vs.-ML model κ = 0.598). While no ML model surpassed human performance on an independent institutional test dataset (human AUC = 0.901, max ML AUC = 0.881), a hybrid model aggregating human and ML predictions demonstrates predictive performance comparable to the consensus of two expert neuropathologists (hybrid classifier AUC = 0.921 vs. two-neuropathologist consensus AUC = 0.920). We also show that models trained at different levels of magnification exhibit different types of errors, supporting the value of aggregation across spatial scales in the ML approach. Finally, we present a detailed interpretation of our multi-scale ML ensemble model which reveals that predictions are driven by human-identifiable features at the patch-level.


Machine learning can aid in prediction of IDH mutation from H&E-stained histology slides in infiltrating gliomas
Benjamin Liechty 1,6 , Zhuoran Xu 1,6 , Zhilu Zhang 2,6 , Cheyanne Slocum 3 , Cagla D. Bahadir 4 , Mert R. Sabuncu 2,5,7* & David J. Pisapia 1,7* While Machine Learning (ML) models have been increasingly applied to a range of histopathology tasks, there has been little emphasis on characterizing these models and contrasting them with human experts. We present a detailed empirical analysis comparing expert neuropathologists and ML models at predicting IDH mutation status in H&E-stained histology slides of infiltrating gliomas, both independently and synergistically. We find that errors made by neuropathologists and ML models trained using the TCGA dataset are distinct, representing modest agreement between predictions (human-vs.-human κ = 0.656; human-vs.-ML model κ = 0.598). While no ML model surpassed human performance on an independent institutional test dataset (human AUC = 0.901, max ML AUC = 0.881), a hybrid model aggregating human and ML predictions demonstrates predictive performance comparable to the consensus of two expert neuropathologists (hybrid classifier AUC = 0.921 vs. two-neuropathologist consensus AUC = 0.920). We also show that models trained at different levels of magnification exhibit different types of errors, supporting the value of aggregation across spatial scales in the ML approach. Finally, we present a detailed interpretation of our multi-scale ML ensemble model which reveals that predictions are driven by human-identifiable features at the patch-level.
With the advancement of computer processing power and the demonstrated utility of deep learning approaches across multiple data-rich domains, the adoption of machine learning to medical diagnostics is anticipated to have a transformative effect on patient care. Already, methylation-based machine learning (ML) approaches to the classification of tumors of the central nervous system (CNS) have demonstrated performance that can exceed traditional histology-based diagnosis 1 , and has allowed for the identification of novel entities 2 and molecular subtypes within established classification systems [2][3][4] . Molecularly-defined entities continue to emerge, many demonstrating overlapping histology with other established tumor classes 5 . However, routine histopathologic examination remains the mainstay of oncologic diagnosis due to its low cost, ubiquity, limited availability of molecular testing, and established robustness-particularly when performed by experienced subspecialty expert histopathologists. Even in healthcare centers with access to advanced molecular assays, the availability of subspecialty experts needed to perform organ-specific histopathologic examination and integrate molecular results into the overall diagnostic picture may be lacking. Developing robust machine learning models that leverage the immense, data-rich trove of existing and prospective histology slides via digitally scanned whole slide images (WSI) and that reproduce or augment subspecialist histopathology expertise can (1) help general pathologists render accurate subspecialty diagnoses, (2) serve as a check on human sources of error by acting as a highly reproducible and fatigue-free assistant, (3) help prioritize the highest yield assays for a given specimen, reducing costs and tissue expenditure, and (4) reveal discordant biases between ML models and human pathologists, which when approached synergistically could increase the detection of clinically pertinent biomarkers more reliably Single-scale ML models make distinct errors relative to each other and to humans. Comparisons of patient-level predictions of the pathologists and classifiers using the WCM data are shown in Fig. 3.  Figure 3B shows a scatter plot of the pathologist consensus score (averaged semiquantitative predictions of the pathologists) compared to the MSE predictions. The correlation between MSE and pathologist consensus is less than between the two pathologists (Pearson coefficient R = 0.674), and correspondingly there is a lower degree of concordance between the binary classifications (Cohen's kappa = 0.598). Among discordant cases, there is a slight enrichment of IDHmut cases that are accurately predicted by the pathologists and missed by the MSE, while there is slight enrichment of IDHwt cases accurately predicted by the MSE and missed by the pathologists. Figure 3D shows patient-level IDH prediction scores from the single-scale and multi-scale ensemble classifiers, pathologists, and hybrid predictions, highlighting the orthogonal nature of errors made at individual levels of magnification. A heatmap of the slide-level predictions from each classifier is shown in Supplemental

Patch-level predictions reveal features that drive accurate and inaccurate predictions.
To gain insight into (1) the decision-related morphological features of the ML models and (2) the types of errors made by both the classifiers and pathologists, sliding patch-level IDH predictions were generated for selected slides using the MSE, three of which will be examined in further detail here (Fig. 4). In the first informative case ( Fig. 4A-D), neuropathologists were correct in predicting IDH mutation, but the case was inaccurately predicted by the MSE to be IDHwt at the slide-level. Regions shown in yellow (Fig. 4C) were predicted by the MSE as consistent with IDH mutation, and were also recognized by the neuropathologists as harboring relatively hypercellular infiltrating tumor that was likely IDH-mutant. Regions encoded in blue (Fig. 4D) drove the overall slide-level misclassification of MSE. These regions were enriched in brain parenchyma without definitive infiltration by tumor cells (as determined by human examination) and were disregarded as non-contributory to the classification task by the neuropathologists. Although the classifier was correct in determining that these areas were not enriched for IDH-mutated tumor, the binary classification task of determining the slide's overall IDH status was evidently hampered by the large presence of uninvolved brain. In a second case ( Fig. 4E-J), that of an IDHmut glioma that was inaccurately classified by both the neuropathologists and the MSE, many regions harbored a relatively monomorphic gemistocytic cytomorphology (Fig. 4G). These regions were accurately interpreted by the classifier as consistent with IDHmut status, and in retrospect also likely would have been favored to represent IDH-mutated tumor to the neuropathologists if presented in isolation. However, one region of marked nuclear pleomorphism (4J) was interpreted by both the classifier and the neuropathologists as representing IDHwt tumor, driving the misclassification. Human-determined 'uninformative regions' again drove inaccurate MSE classification of particular areas: regions of uninvolved www.nature.com/scientificreports/ brain but with increased white-space around individual neurons and vascular channels due to tissue processing artifacts and/or edema were predicted as IDHmut (4H) by the MSE, while regions of relatively uninvolved brain and without significant intraparenchymal white-space (4G) were again erroneously predicted as IDHwt as before.
The final example ( Fig. 4K-P) illustrates an IDHmut glioma inaccurately predicted by the neuropathologists as IDHwt, but correctly predicted by the MSE. In this case, solid regions of tumor (4M and 4N) were accurately predicted by the ML classifier as areas with (IDH-mutated) tumor. A large area of necrosis was present in this slide (4O), which drove inaccurate prediction of IDHwt by both neuropathologists, and this area in isolation was also classified as IDHwt by the MSE. Once again, the MSE interpreted regions of minimally involved normal brain (  www.nature.com/scientificreports/

Patch-level embedding vectors reflect diagnostically relevant human-identifiable features.
To gain further insight into the histological features encoded by our trained ML models, 5 random patches were selected from each slide in the WCM dataset and uniform manifold approximation and projection (UMAP) was performed on the patch-level embedding vectors from the best performing 10x-scale classifier ( Fig. 5A-B). Review of histological features in clustered patches revealed consistent patterns across patches obtained from distinct slides. Emergent human-identifiable features included: (1) microcystic architecture (Fig. 5C), which is correlated with IDHmut status, (2) hypercellular regions of tumor with round, monomorphic nuclei, reminiscent of oligodendrocytes, which were appropriately enriched for IDHmut tumors (Fig. 5D), (3) hypercellular tumor areas with spindled nuclei and greater pleomorphism, enriched for IDHwt tumors (Fig. 5E), and (4) brain parenchyma without significant human-detectable involvement by tumor (by H&E), that were predicted by the classifier as harboring IDH mutation irrespective of the ground truth slide-level class (Fig. 5F). Other features captured by the embedding vectors include patches with a significant amount of whitespace (Fig. 5A, top-right) and regions with abundant hemorrhage or necrosis (Fig. 5A, top center). The ground-truth IDH-status and integrated molecular diagnosis of the UMAP coordinates are shown in Supplemental Fig. 6. A high-resolution version of Fig. 5A is available upon request.

Discussion
Just as the molecular classification of neoplastic disease and its impact on patient care have emerged rapidly in the last decade, ML techniques and computational resources continue to progress. An open question in medical diagnostics is whether existing data-rich resources such as WSIs, effectively encodes untapped information that could be leveraged to guide patient management while minimizing the use of more advanced but less accessible modalities. We selected the task of predicting IDH mutation in infiltrating gliomas as a prototypical problem within this space, using CNN models with H&E-based histological information as the sole input. Moreover, we compared the performance of this task over multiple magnification scales. As a reference point, we compared the performance of the CNN models, trained on the order of hours, with those of subspecialty-trained expert neuropathologists, trained on the order of years to decades. While the models demonstrate very high accuracy on the TCGA dataset, a significant drop in performance is seen when applying the same models to the WCM dataset. We believe this is a result of recurrent batch effects in tissue fixation, processing, and staining between different laboratories, and that the overperformance on TCGA data likely represents an element of overfitting www.nature.com/scientificreports/ on batch effects from a relatively small number of centers, and the performance on the WCM dataset is likely more representative of the performance that would be seen real-world deployment on samples from laboratories to which the models are naïve. Comparison of ML model predictions to expert pathologists shows that while similar degrees of accuracy are obtained on the classification task, the types of errors made were distinct, combining pathologist predictions and ML predictions results in greater classification robustness than either alone. Manual interrogation of patch-level predictions demonstrates several confounders exploited by the ML models which, interestingly, were found to be reproducible at all levels of magnification. The most striking source of errors in our models were regions of human-interpreted low informativity within the underlying tissue. Specifically, regions of brain without definitive tumor cells were often classified as IDHwt, while regions with increased white-space secondary to vacuolation, edema, and/or tissue artifacts were often classified as IDHmut. While areas lacking tumor are indeed IDHwt per se given the putative absence of tumor cells, the task was built around slide-level classification of de facto tumors. Our interpretation, therefore, is that classifying these regions as IDHwt on-average drove the classifier to a higher degree of accuracy overall, despite patch-level 'uninformativeness' as determined by human www.nature.com/scientificreports/ observers. One approach to address the confounding effects of such regions is to explicitly annotate and train toward a third-class label, that of "non-neoplastic brain" from autopsy and epilepsy cases. Surprisingly, in the set of sliding window heatmaps analyzed, the models were not clearly driven by features that pathologists often used to predict IDH-class due to their enrichment in IDHwt tumors, such as well-formed palisading necrosis and microvascular proliferation. The presence of human-identifiable features as seen in the UMAP projections demonstrates that the CNNs can recognize some of the features used by humans in the classification of gliomas. We found that patches demonstrating microcystic architecture or oligodendroglial cytomorphology were enriched for IDHmt classification while patches with increased spindled cells and pleomorphism were enriched for IDHwt classification. Histomorphologic correlates of certain driver alterations have been previously identified, such as giant-cell morphology in IDHwt glioblastomas harboring TP53 mutation, and epithelioid morphology in high grade gliomas harboring BRAF mutations; however, given the heterogeneity of infiltrating gliomas, and particularly in IDHwt astrocytomas/glioblastomas, these morphologic correlates as assessed by human pathologists have relatively poor predictive utility [32][33][34][35][36][37] . The UMAP also clearly illustrated that regions of human-interpreted low informational value relative to the task were enriched for particular classes, such as normal appearing brain being enriched for IDHwt class. Again, the identification of recurrent confounders across these models suggests that strategies to devalue or exclude uninformative patches could further improve classification accuracy, and expanding the number of available classes to include non-neoplastic samples, as alluded to above, may improve ML performance. In addition, we believe that given a sufficiently large dataset of histologic data paired with RNA transcriptome and DNA methylation profiling, histomorphologic correlates may be identified, however further studies will be necessary to assess for this. Methods to aggregate patch-level predictions into slide-level classifications are a widely studied problem in the multiple-instance learning literature. Attention mechanisms that increase the weight of highly informative patches on the final classification prediction have been found to be useful in other cancer types 22,38 . However, the differing biological characteristics of tumor types that are reflected in histology (for example that infiltrating gliomas typically have an ill-defined border with respect to the surrounding non-neoplastic tissue, a feature that differs significantly from that of epithelial cancers) are likely to impact the efficacy of any particular ML algorithm, and the strategies employed are unlikely to be universally applicable to models trained for all diagnostic tasks. In our experiments conducted with this dataset, we also tested attention pooling mechanisms to aggregate patch-level embeddings into slide-level embeddings using the method described by Ilse et al. 38 (https:// arxiv. org/ pdf/ 1802. 04712. pdf), where weights for each patch embedding are learnable; however, this attention mechanism did not provide a significant improvement on classification performance relative to naïve averaging of embedding weights (data not shown). That said, as the number of potential target outputs of the model increases, attention mechanisms may help boost performance, but future studies using a broader variety of target classes are necessary to better assess this. Our results demonstrate that the level of magnification used for input images impacts the ML model accuracy, with the greatest levels of accuracy achieved at intermediate levels of magnification (corresponding to 10 × objective in our study). One interpretation of this finding is that while lower levels of magnification provide a larger field of view with a greater degree of overall tissue sampling and increased architectural information, higher levels of magnification provide increased cytologic detail yet with a smaller field of view. Intermediate levels of magnification may represent a "sweet-spot" capturing both low-power and high-power information. Of practical importance, some errors made by models using different levels of magnification were found to be orthogonal, and to the errors made by human observers, providing a rationale for multi-scale ML models and hybrid MLhuman approaches. We also believe that designing ML models to explicitly recapitulate the human methodology of examining the tissue at lower power, and then selecting regions of interest to interrogate at higher power could result in more robust model predictions, while also using less computational resources than interrogating an entire image at high power. However, future studies will be necessary to confirm this.
While routine H&E staining is not used to make determinations of mutational status, its global availability, low cost, and diagnostic richness have established it as a mainstay of surgical pathology for over a century. Immunohistochemistry for the most common pathogenic IDH mutation (IDH1 R132H) detects 85-90% of IDH mutant gliomas, with the remainder requiring DNA sequencing to identify. Of note, in our cohort, all slides harboring non-R132H IDH-mutations were correctly classified by our models (n = 6, IDH1 R132C = 4, IDH2 R172K = 2). This work suggests that computer vision-based approaches may assist in subclassification of tumors for which gold-standard molecular diagnostics are not universally available and in selecting assays for additional testing.
Studies evaluating at the ability of ML models to predict IDH mutation status have been previously published. Jiang et al. 39 found that WSI could be used to predict IDH mutation status and survival in gliomas with grade 2 and 3 histology. Liu et al. found that the inclusion of a generative adversarial network (GAN) to augment training data and including patient age as a model input could both improve model accuracy. Both these studies used TCGA glioma cohorts for model training. To our knowledge, our work is the first to evaluate aggregate expert human predictions with model predictions, and to compare the features learned by ML models with those that have been identified as predictive by human pathologists, and to compare the predictions of ML models at multiple levels of magnification. Further studies will benefit from larger image slide datasets including greater variability of laboratory-specific staining protocols. In addition to training models to detect particular clinicallyrelevant molecular alterations, of interest will be to train models directly toward patient outcomes in an effort to disclose previously unappreciated histological features of clinical and prognostic relevance.
This study demonstrates that ML models can achieve near human-level performance at predicting clinically relevant oncologic biomarkers of CNS tumors using H&E-based histological information alone, even with a completely external test set, with training times and slide exposure that is minimal compared to that needed www.nature.com/scientificreports/ to train human subspecialty experts. Moreover, by analyzing single magnification and multi-scale models and interrogating encoded features through heatmap and UMAP visualizations of patch-level predictions, crucial insights of how to iteratively improve the ML models can be obtained. Our study represents a proof-of-principle that ML models hold great promise in approaching and potentially superseding human level performance of biomarker detection via deep learning of widely accessible H&E slides, paving the way to uncovering the full diagnostic and prognostic potential of this ubiquitous data modality.

Materials and methods
Human subjects research. This research and experimental protocols were conducted in accordance with Weill Cornell Medicine's Institutional Review Board requirements under the IRB-approved protocol #1312014589. All patients were initially consented to surgical procedures from which slide image data was obtained as per institutional guidelines. The IRB for this research itself was approved with a waiver of consent given its retrospective nature and given there was no contact with patients, and the research conforms to the ethical requirements and HIPAA compliant protections mandated by the institutional IRB.
Dataset. In this study, we used datasets from two cohorts of infiltrating gliomas patients obtained from The Cancer Genome Atlas (TCGA) 40 and Weill-Cornell Medicine (WCM Image preprocessing. We first tiled all WSI into non-overlapping patches of size 256 by 256 pixels at spatial resolutions corresponding to 2.5×, 5×, 10×, and 20× magnification (Fig. 1A). Pixel values ranging between 40 and 215 in greyscale space were treated as informative tissue, and pixels outside this range were considered uninformative, either as background whitespace (> 215) or folded tissue (< 40). Only patches with over 75% tissue percentage were kept for further training and testing. All patches with significant blurriness or pen marks were excluded by thresholding RGB values obtained heuristically.
Image augmentation. To increase the model generalizability and reduce potential overfitting, we implemented several image augmentation strategies during training. Since all patches within each batch were from one WSI, color augmentations were performed on slide level for each iteration, i.e., we only used one set of color augmentation parameters each iteration for all patches from each slide. We first transformed RGB patches into HSV color space. Then pixel values were augmented channel-wise as: I aug c = α c I c + β c . I c were pixel values in channel c . α c and β c were channel specific color augmentation factors. α c and β c were sampled uniformly from U(1 − σ , 1 + σ ) and U(−σ , σ ) respectively for each slide. We set σ as 0.05 to control augmentation degree. In addition, each patch had 50% probability of being flipped either vertically or horizontally and equal probability (25% each) of being rotated by 0, 90, 180 or 270 degrees. Distinct augmentation parameters were randomly generated during patch selection for each mini-batch.
Model training. After the image preprocessing step, each WSI had four sets of patches corresponding to magnifications of 2.5×, 5×, 10×, and 20×. Single-scale models were trained for each scale. We used a pre-trained DenseNet-121 architecture 31 , without the last dense layer, as the feature extractor to generate patch-level embeddings of length 1024. All patch-level embeddings from one slide generated in each iteration were aggregated into slide-level embeddings using average pooling. A randomly initialized fully connected layer with 1024 nodes was then implemented to take the aggregated slide-level features as input and output slide-level IDH mutation probabilities. Due to memory constraints, only 200 patches from one WSI were randomly selected and passed to the network for each training step (Fig. 1B). If there were less than 200 patches for one slide, we used all available patches in that mini-batch. Note the mini-batch consisted of a single WSI. To keep IDH classes balanced during training, we randomly sampled 140 IDHwt slides and used all 140 IDHmut slides in each training epoch. We used Adam as to minimize binary cross-entropy loss 41,42 with a learning rate of 0.00001, and a maximum of 100 epochs 43 . All network parameters, including the weights of the DenseNet-121 backbone were updated during training. Models from the epoch with the best validation loss were used. Three separate single-scale models were trained using different random initial seeds. www.nature.com/scientificreports/ Model inference. The trained models from the last step can be used for predicting both patch-level and slide-level IDH mutation status. We first averaged the three slide-level probabilistic predictions at a given scale to compute single-scale predictions. A multi-scale ensemble (MSE) was then computed by averaging all four single-scale predictions (Fig. 1C). For patients with multiple slides, patient-level predictions were computed by averaging slide-level predictions. For measures of prediction accuracy, a threshold of 0.5 was used as a cutoff for IDHmut status.
Pathologist evaluation. The WCM test set was separately evaluated by two neuropathologists (authors BL and DP), blinded to all patient information and ancillary testing beyond the WSI, to compare the model predictions to human observers. For each case, both pathologists were asked to issue a prediction for IDH status in a semiquantitative scale, normalized to a range of 0 and 1 (i.e., 0 for a prediction of IDHwt and 1 for IDHmut, values close to 0.5 for cases with low certainty). The pathologists' predictions were then averaged to generate a two-pathologist consensus score. The predictions from each pathologist were averaged with the MSE prediction to generate hybrid classifier scores, and the two-pathologist consensus score was averaged with the MSE predictions to generate a two-pathologist consensus-hybrid model. UMAP visualization. We randomly selected five 10× patches from each WSI in the WCM test set for UMAP visualization 44,45 . Patch embeddings extracted by trained convolutional base of the best performing 10× classifier were used as patch representations. We used the Python UMAP package with default hyper-parameters to obtain the UMAP representations for each patch. For visualization purposes, the first two dimensional vectors of UMAP projections were used as coordinates to show the original input patches, ground-truth IDH mutation status, ground-truth integrated molecular diagnosis (oligodendroglioma, IDHmut astrocytoma, IDHwt astrocytoma), patch-level IDH prediction scores, and slide-level IDH prediction scores from the classifier. The patches were then reviewed by the pathologists to determine the presence of human-identifiable features in each clustering, and the association between histomorphology with specific diagnoses.
Statistical analysis and software. All model trainings and inferences were performed on 4 NVIDIA Titan X GPUs. Image preprocessing, model training and inference were conducted in Python, version 3.7.4. OpenSlide python was used for reading and tiling WSI. Pytorch was used for training neural networks. All statistical analyses were performed in R, version 4.0.3. Slide prediction heatmaps were plotted using the Complex-Heatmap R package 46 . Age differences were evaluated using t-test. Chi-square test was used to test the gender difference between two IDH status groups. Confidence intervals of model performance metrics were evaluated through sample bootstrapping for 1000 times. All statistical tests were two-sided with a significance threshold of p < 0.05.
Significance. We show that combining an expert pathologist's assessments with ML model predictions can classify IDH mutation status in infiltrating gliomas at a comparable level to two-expert consensus. Our study is a proof of principle for the broader application of ML models in deriving clinically relevant molecular markers based on histopathology alone. We also demonstrate that ML-based histopathology classification accuracy varies with level of magnification, and discordant errors are made across scales. This suggests value in ensembling across levels of magnification.

Data availability
All TCGA histology image data used in this study is publicly available through https:// portal. gdc. cancer. gov/ repos itory. De-identified metadata corresponding to the WCM histology image dataset (WSI database) is available upon request. Raw scanned *.svs files corresponding to the WCM histology image dataset may be shared in accordance with institutional guidelines including development of an institution-specific Materials Transfer Agreement and in accordance with appropriate HIPAA-compliant interinstitutional IRB-approved protocols.

Code availability
All source code and guidelines are publicly available on https:// github. com/ Karen xzr/ IDHmut.